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Abstract 

We consider the problem of estimating a sparse high-dimensional normal mean 
vector using an empirical Bayes approach. Our proposed model features a unique 
form of shrinkage, and has strong finite-sample performance. We also prove, un- 
der certain conditions, that our empirical Bayes posterior concentrates on balls, 
centered at the true mean, with squared radius proportional to the frequentist min- 
imax rate for the given sparsity class. Asymptotic minimaxity of the corresponding 
empirical Bayes posterior mean is also shown. 

Keywords and phrases: Data-dependent prior; high-dimensional; fractional like- 
lihood; posterior concentration; shrinkage; two-groups model. 
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1 Introduction 

High- dimensional problems, where the parameter is effectively lower-dimensional, are 
now commonplace in statistical applic ations. Common examples incl ude variable se- 



Cai and Zhou 2012: Lam and Fan 



lection in regression flFan and Lv 120101) . covariance matrix estimati on fICai et al.l 2010 



20091 



1), larg e-scale mu l tiple testing (Bogdan et al.ll2011 



Cai and Jinll2010l ). and function estimation ( Call 120121 : Ijohnstone and Silvermanl l2005l ) 



The canonical example, which we shall consider here, is that of estimating a sparse 
normal mean. Let Xi, . . . ,Xn be independent observations, with Xi ~ N(^j,l), i = 
1, . . . ,n, and the goal is to estimate the mean vector 9 = (9^ ^ . . . , 9„) under squared- 



error loss H^' — ^||^, where || • J | is the usual £^-norm on R" (e.g., lAbramovich et al.l 12006 



Brown and GreenshteinI l2009l : ICastillo and van der VaartI 120121 : iDonoho and Johnstone 



19941 : iDonoho et al.lll992l : I Jiang and Zhang l2009f ). With only a single observation Xi for 
each 9i, accurate estimation is not possible without some structure. Assuming 9 is sparse, 
i.e., most ^j's are zero, makes the effective dimension relatively small and good estimation 
becomes possible. 

When the normal mean is assumed to be sparse, a variety of estimation methods are 
available. Thresholding estimates are popular, perhaps because of their simplicity and 
intuition. The universal hard thresholding rule, for example, estimates 6i with 6i{t) = 
Xjl{|x,|>j}. That is, if the magnitude of Xi is large, then 9i is likely non-zero, so use the 
optimal component-wise estimator Xf, on the other hand, if the magnitude o i Xj is small, 
then 6i is likely zero, so set 6i = 0. For suitable t, iDonoho and Johnstone! (119941 ) have 
shown that 6{t) is asymptotic ally minimax as n —y oo. More sophisticated thresholding 



schemes are also available; see lAbramovich et al.l (120061 ). 



Re cently, this problem has been considered from a Bayesian perspective in lCastillo and van der Vaari 
(120 12l ). They focus on frequentist properties of a Bayesian posterior distribution, and the 
corresponding Bayes estimators, for so-called two-groups priors. In sparse estimation 
problems, a two-groups prior puts positive probability on 6 vectors with some exact zero 
entries, so the marginal prior for e ach component is a mixtur e of a continuous distri- 



bution and a point-mass at zero. ICastillo and van der VaartI (120121 ) show that, for a 



suitably chosen two-groups prior, the posterior concentrates around the true signal at 
the asymptotically optimal minimax rate. From this, concentration properties of pos- 
te rior quantities, such as the poste rior mean, can be derived. An important message 
in ICastillo and van der VaartI (120121 ) is that care is needed in choosing the prior for the 
non-zero 6 entries. In particular, they show that priors with too light tails, e.g., Gaussian, 
give sub-optimal concentration properties. The results presented herein provide similar 
guidance, though our perspective is quite different. 

Here we take a novel empirical Bayes approach. In particular, we present a hierarchical 
two-groups prior where, given a weight u, the 6'j's are modeled as independent, with 9i = 
with probability gi{u), and 9i ~ /zi(6' | u) with probability 1 — gi{uj), where the functions 
Qi and hi depend on data Xi. These functions are defined explicitly in Section [2l To 
complete the hierarchy, u is assigned a prior concentrated near 1. We argue that the 
effect of the data- dependent prior is mitigated by preventing the posterior from tracking 
the data too closely. This approach provides some ne w insights, which we compare wit h 
those coming from the fully Bayesian framework of ICastillo and van der VaartI (120121 ). 
However, our method relies on a very simple Gibbs sampler which is facilitated by the 
direct use of the u parameter. 

In Section [3] we present our theoretical framework and show that the empirical Bayes 
posterior distribution concentrates around the true mean vector at the optimal minimax 
rate (with respect to square error loss and the degree of sparsity) with probability 1. 
Concentration rate theorems for empirical Bayes posteriors are relatively scarce in the 
literature, and our technique for handling the challenges that arise from data appearing 
in both the likelihood and prior might be useful in other problems. We go on to show that 
the corresponding empirical Bayes posterior mean, as an estimator of 6, is asymptotically 
minimax. An interesting observation is that the particular form of the prior on u is 
effectively driving the concentration of the empirical Bayes posterior distribution. 

Section Hj describes computation of our empirical Bayes posterior mean via a straight- 
forward Markov chain Monte Carlo. We contend that the relative simplicity of the pro- 



posed method compared with that in, say, ICastillo and van der VaartI fl2012l ). is due to 



our focus on the weight parameter u rather than the support size. Simulation results are 
presented to show that our empirical Bayes posterior mean generally outperforms a vari- 
ety of Bayesian and non-Bayesian competitors across a number of settings. In particular, 
we compare our meth od with several o f those derived from one-group prior Bayes mod- 



els, such as the lasso flTibshiranilll996l ). the Bayesian lasso ( iPark and Casellal 120081). the 
horse shoe f Carvalho et al.ll2010l ). and the recent Dirichlet-Laplace f Bhattacharva et al. 



2OI2I ). For relatively strong signals, our empirical Bayes estimator clearly outperforms 
these competitors. So, while there is a recent trend towards these kinds of one-group 
models, no rates have been found to date and, moreover, our method with a demon- 
strably optimal rate appears to have better finite-sample performance. Some concluding 
remarks are given in Section [51 

2 An empirical Bayes model 

For the independent normal mean model, Xi ~ N(6'j, 1), i = 1, . . . ,n,let pg- {xi) denote the 
density oi Xi, and, for x = (xi, . . . ,x„), let Pg{x) = YYi=iPeii^i) denote the corresponding 
joint density oi X = {Xi, . . . ,X„). Define a data-dependent hierarchical prior Ux for 
6 = {61, ... , 9n) as follows. Introduce a weight parameter uj G (0, 1), and take the joint 
prior distribution for (6'i, . . . , 6'„, w), under IIx, to have density proportional to 



cu""- 






where a > 0, k G (0, 1), and a^ > are parameters to be discussed in more detail in 
Sections [3H4] below. A representation of this prior as a genuine empirical Bayes plug-in 
prior is given in Section [XT] The dependence of the prior IIx on {a, k, a^) shall be hidden 
in our notation. 

Observe that if cr^ < j^, then the prior for 9i is proper, a mixture of a point mass 
and a Gaussian centered at X,. When a^ > j^, the prior is improper. In any case, the 
posterior is proper, so this possible impropriety of the prior is not a problem. In fact, 
cr^ = Y~ is a critical boundary, corresponding to an improper uniform distribution for 
the non-zero 6'j's; see Section [3l3l The term cu"'""^ in the joint density, which resembles 
a beta density, turns out to be critical to the success of our proposed method, both in 
theory and in implementation. 

Given data X = {Xi, . . . , Xn) from the normal mean model and the empirical Bayes 
prior distribution IIx for 6, we could combine these to form an empirical Bayes posterior 
distribution via Bayes theorem. That is, for a suitable set A in the ^-space, define the 
probability measure 

Qn{A) = Qn,x{^) « / P^eiX) Uxide). 

J A 

We will investigate concentration properties of the empirical Bayes posterior in Section [31 
In particular, we show that the empirical Bayes posterior mean derived from Qn is an 
asymptotically minimax estimator of 9. 

It might seem that our apparent double-use of the data — in the prior and in the 
likelihood — could lead to a posterior Qn that tracks the data too closely. To see that 



this is not the case, note that if \Xi\ is large, then the prior probabihty for 6i = 0, under 
Ux, would be rather large. Thus, the prior has an unexpected shrinkage effect, pushing 
6i corresponding to Xi with large magnitude towards zero. On the other hand, an Xi 
with large magnitude shifts the prior on the non-zero part further from zero, effectively 
making the tails heavier, to accommodate large signals. These two phenomena suggest 
that using data in both the prior and the likelihood will not result in a posterior that 
tracks data too closely. In fact, our theoretical and numerical results demonstrate that 
the posterior is doing the right thing, namely, concentrating on the true 6. 

3 Empirical Bayes posterior asymptotics 

3.1 Fractional likelihood as a mathematical device 

To start, it will help to look at the proposed model from a different perspective. For 
mathematical convenience, we shift our focus and rewrite the empirical Bayes posterior 
Qn using a fractional likelihood. That is, we write Pq{X) = pg{X)'^pQ{XY~'^ and move 
the 1 — K fraction into the prior IIx defined above. The effect of this is an alternative 
prior for {6, u) of a very simple form: 

9i\uj'^ ujSo + {l-uj)M{Xi,a'^), i = l,...,n, 
uj ~ Beta (an, 1). 

To provide some further intuition for the prior ([1]) presented in Section [21 we may consider 
a data-free version of the prior in ([2]), where the Xj's are replaced by hyperparameters 
/ij. Then ([2]) can be understood by thinking of X^ as the maximum marginal likelihood 
estimate of /ij. Then the prior ([T]) is that obtained by undoing the fractional likelihood. 
Within this alternative perspective, we introduce independent binary latent variables 
Ii, . . . ,In, where /« = 1 if and only if 9i = 0. Then, given oj, the indicators Ji, . . . , /„ 
are independent Ber(u;) variables. These indicators characterize the support of the vector 
9; in particular, X]r=i(-'- ~ -^*) ^^ ^^^ number of non-zero 9i, or the support size, and is 
distributed as Bin(n, 1 — u). The beta prior for uj is concentrated near 1 for n large, 
so the support size will tend to be small, consistent with the assumption of sparsity. 



Castillo and van der VaartI (l2012l ). on the other hand, focus primarily on priors directly 
on the support size, though this kind of beta-binomial prior is considered in their Ex- 
ample 2.2. But the direct use of the weight u is both theoretically and computationally 
convenient; see Remark [TJ 

Write this new version of the prior as IIx, and express the posterior as 



Q„(A)oc f p^{XrUx{d9). 

J A 



This version of t he empirical Bayes poster ior is particular amenable for our asymptotic 



analysis; see, also lWalker and HjortI (120011 ). 



3.2 Lower bound on the denominator 

In the normal mean model, let 9* denote the true mean vector. Assume that 9* is 
sparse in the sense that most of its entries are zero. To make this more precise, let 



S* C {1,2, ... ,n} denote the support of 6*, i.e., 6* 7^ if and only if z G S*. Let 
Sn = H^S* be the cardinahty of S* , and say that 9* is s„-sparse. Then by sparse we 
mean that s^ — )■ cxo but Sn = o{n) as n — )■ 00. That is, ahhough 6* is n-dimensional, its 
effective dimension is actually much smaller. 

Start by rewriting the empirical Bayes posterior Qn once more as 



Qn{A) 



JAPeix)/pUx)rux{de) 



(3) 



Our overall goal is to show that Qn concentrates its mass near 6* with P^-probability 1. 
The strategy is to show that the denominator of Qn is not too small, and the numer- 
ator, for sets An away from 6*, is not too large. In this subsection, we will bound the 
denominator for all {a, k, a^) choices. 

Our first result gives a bound on the denominator of Q n , like that which derives 



from the Kullback-Leibler property (e.g.. iBarron et al.lll999t iGhosal et al.l Il999l 12000 



Schwartzlll965l : IShen and Wassermanll200ll ). This lower bound will be used in Section |3] 
to derive vanishing upper bounds on the Qn-probability assigned to complements of balls 
around 6*. But besides as a tool for proving other things, the following lemma suggests 
that our empirical Bayes-s t yle pr ior i s sufficiently concentrated around 6*. That is, as 
Castillo and van der VaartI (120 12[ ) and iBhattacharya et al.l ( 120121 ) show, without suitable 
prior concentration, the desired posterior concentration is not possible. Therefore, if we 
associate lower bounds on the denominator of Q„ in ([3]) with adequate prior concentration, 
then Lemma [1] says that our prior is sufficiently concentrated around 6*. 

Lemma 1. Let Dn be the denominator in ([3]). If 6* is Sn- sparse, then there exists 77 G M, 
depending on {K,,a,a'^), such that Dn > j^^wiv^n — 2s„ log(n/s,i) + o(s„)} with P^- 
probability 1. 

Proof. Write Dn in terms of the conditional prior {9i, . . . , On) \ w ~ Llx.aj and the marginal 
prior a; ~ vf for cj under fix- That is. 



Dr 



'^Y'^xAdO)^{duj) 



1 n 

n 

4 = 1 



Peti^i 



Ux,um)Hdu) 



For given u, the inner expectation involves an average over all configurations of the 
indicators (/i,...,J„) defined in Section [XT] This average is clearly larger than just 
the case where the indicators exactly match up with the support S* of 9*, times the 
probability of that configuration. That is. 



'0 



X 



n 



62 



{{x,-e*)^-{Xi-e,,f}_ 



V2na^ 



e 2<T 



^(x,-eif 



dOi, 



The term a;*"~"(l— cu)*" corresponds to the probability for the configuration of (/i, . . . , J„) 
matching the support S* . The integral for z G 5* is the expectation of the normal density 
ratio for non-zero 9i with respect to the N(Xj, a^) prior. Finally, the product over i ^ S* 



disappears because po{Xi) = pg*{Xi) for i ^ S*. To further bound this quantity, first pull 
out the terms exp{|(Xj — 6*,*)^} in the latter integrand that do not depend on 6i. Since, 
by the law of large numbers, s~^ X]je5*(^« ^ ^i^ — ^ 1, as n — )■ oo, with P^-probability 1, 
this part contributes a factor exp{|s„ + Oa_,sXsn)} to the lower bound for D„. Next, some 
straightforward calculus gives 



e 2 



{x,-ei)^ 



v27rcr' 



-.e 2,7 



^(Si-Xi)^ 



dOi 



1 



;i + ^(^2)1/2 ■ 



So, the remaining product over i E S* equals (1 + ko"^) *"/^, and we can conclude that 
the entire product over i G 5* in the lower bound for D„ is itself lower bounded by 



exp 



s„A- - log(l + Kcr^ 



+ 0, 



a.s. \'^n) 



It remains to bound the first integral over uj. Since 7i{du!) = anu 
constant a > 0, we have 



an—l 



du, for some 



w" """(l - uY" TT{du) > an 
Jo 



l-Sn/n 



> I — 

n 



u 
an 



n—Sn+an—l , 



uf" du 



1- — 
n — Sn + an\ n 



g \ n—Sn+an 



> 



a 



a \ 2s„ 
■Jn 



1 



(4) 



I + a\ n / \ n 
The last inequality follows since (1 — 6)^"^ > b^ for small b > 0. Next, if we write 

(1 - Sn/ny = exp[-an{-log(l - s„/n)}], 

and use the approximation — log(l — x) = x + o{x), for x ~ 0, then we get a lower bound 
on the w-integral of the form: 

cexp{-2s„log(?i/s„,) - asn + o{sn)} , n-)- oo, 

for c = tt/(l + a) > 0. Putting these pieces together, gives the lower bound 



Dn > cexp 






2Sn\og{n/Sn) 



Set ?7 = I — a — log(l + ko"^) G M to complete the proof. 



D 



3.3 Concentration 

In the frequentist problem of estimating a s„-sparse vector 6 under f^-e rror loss, it is 
known that the m inimax rate is proportiona l to Sr,. := s„ log (n/s„); see iDonoho et al. 



fll992l ) . Following ICastillo and van der VaartI ( 120121 ) , our goal here is to show that Qn 
concentrates asymptotically on n-balls, centered at 9*, with square radius proportional 
to £„. That is, for a constant M > 0, let 



A 



Men 



{6 ew : \\e-e^\Y > Men}. 




Figure 1: Portion of the feasible region i?^ in (^, with /3 = 200, for (k, a^). Sohd hne 
corresponds to the curve a^ = j^. 



The theorem below requires a restriction on (k, a^). In particular, we require that, 
for some /3 > 1, (k, a^) reside in the feasible region 



Rf- 



K, a 



1 k[{1-k)/3-1] 



(5) 



a2(l + ^/a2)V/3 a2 + /3 ~ /3 - 1 

We are particularly interested in large /3, so that k, arbitrarily close to 1 can be included. 
Figure [1] displays a portion of the region i?^, for /3 = 200. Interestingly, the condition 
cr2 = YT^ discussed in Section [2] defines the boundary of /2^, for large /3 and n ^ 1. In 
our experience, we have found that k ^ 1 and o"^ = _1_ performs well; see Section HI 

Theorem 1. For any fixed /3 > 1, take (k, cr^) in the feasible set Rp. If 9* is Sn-sparse, 
then there exists M > such that Qn(^A/e„) — ^ with P"^ -probability 1. 

Proof. Let Nn be the numerator for Qn{^Me,^) in ©; i-e.. 



Nr, 



JA 



\{{'^Y^x.Aded^{du)- 



Men 4=1 



^(^. 



If we take expectation of N^, with respect to P^*, under which Xi, . . . , X^ are indepen- 
dent, we can see that 



i.e.(Nn 



.,. n 



Pe*[Xi 



Il^^,cuid6i)pe*{xi) dxi 7r{du). 



Write Juj{d9i) for the measure defined in the z-th product term. Split this into discrete 
and continuous pieces: 



Mde,) 



u 



PeAxi)V 



Iix„uj{dSi)Pe*{xi)dxi 
Pe*{xi)dxi\6o{d6i) 



Pe:{xi) 

+ (l-c^) 



PeXxi)YPe,/Axi/cr) 



Pet[Xi) 



(xi) dxi \ dOi 



a 



For clarity, we shall work with the discrete and continuous separately. 

Discrete part. Using the Renyi divergence formula for normal distributions, the dis- 
crete term simplifies to cuexpj — '^^ ~'^' {9i — 6*)"^} 6o{d9i). 

Continuous part. An application of Holder's inequality, with coefficients -g^j and (3, 
whose reciprocals sum to one, gives 

Pe-{xi)J a 



Pet[Xi)y J ^j \ a 



For (k, a"^) E R13, we have j^ < 1. Then the same Renyi divergence formula used above, 

applied to the first term in the upper bound, gives exp{ — 1 ^~^^~ {9i — 0*^}. For the 
second term in the upper bound, direct integration, or the non-central chi-square moment 
generating function, gives 

After some tedious algebra, this can be rewritten as 

Combining the two terms in the upper bound, while ignoring the normal density, gives 

h - et 



lrK[(l-K)/3-l] , . - M,. n.^2 



For (k, 0"^) in the feasible region i?^ in ([5]), the coefficient on {6i — 9*Y iii the exponential 
term above is negative. 

We can now find a constant c > 0, depending on (k, a^), such that 

Then J^{d9) := YYi=i Ju){d9i) is upper bounded by exp{— ci||6' — 6'*|p} times a probability 
measure in 6 on M". Therefore, by definition of AMe„, 



Jo Jaa,^^ 



'0 JA 

Next, take M such that cM > 2, and then take K G (2, cM). Then Markov's inequality 
gives the upper bound 

P^.(Ar„ > e"^"") < Le-("^*^-^^)"". 

This upper bound has a finite sum over n > 1, so the Borel-Cantelli lemma gives that 
Nn < e"^^", with P^-probability 1 for all large n. Putting together this bound on A^^ 
and the one on D„ from Lemma [1], we get 

Dn ~ a 

Since s„ = o(e„), the exponent diverges to —00 regardless of the sign on f3. Therefore, 
Qni^Men) — ^ as n — )■ 00 with P^-probability 1. D 

8 



Remark 1. An interesting observation is that the £„, rate is driven primarily by the beta 
prior on the weight u. In particular, it comes from the term (^)2s„ ^^ ^^le lower bound 
dl]) in Lemma [H This means that the prior for 6, given u, should be selected so that it 
does not interfere with the correct rate coming f rom the lower bound on the denominator 



of Qn- A similar message can be taken from ICastillo and van der VaartI (J2012l ). But 
given the relative simplicity of our derivations, the easy implementation, and the strong 
finite-sample performance, we would argue that focusing on the prior for a; is a good 
choice. 



Remark 2. ICastillo and van der VaartI (120121 ) show that the minimax concentration rate 



will not hold if the prior on non-zero 6 has too light tails, e.g., Gaussian; see, also. 



Bhattacharya et al.l (120121 ). A way to understand this point, from our perspective, is 
that the Gaussian conditional prior interferes with what the beta prior for the weight 
u is doing. As we have demonstrated, this does not necessarily mean that Gaussian is 
wrong, but that some adjustments should be made to prevent this interference. 

3.4 Asymptotic minimaxity of the mean 

For estimating the mean vector, the empirical Bayes posterior mean. 



en = Jeq^ide), 



is a reasonable choice. Next we show that On achieves the frequentist minimax rate for 
estimation of an s„-sparse signal 6*. 

Theorem 2. Take (k, cr^) as in TheoremUi 1/6* is Sn-sparse, then there exists M' > 
such that E0*||6'„ — 6'*|p < M'Sn for all large n. 

Proof. Start by considering the quantity / ||^ — 6**1^ Qn{d9). Split this integral into two 
via the partition M" = Am£„ U ^m£„ ^^^ ^ ^^ i^ Theorem [T] On ^m£„' 11^ ~ ^*P is 
bounded above by Msn, and (5n(^M£„) < 1 trivially. So, we immediately get 



/ \\9 - 9*f QnidO) < Men. 



For the integration over Ameui "we again look at the numerator and denominator of Qn 
separately, as in the previous subsection. The denominator has the same lower bound as 
in Lemma [H Take n large enough that, with Pe*-probability 1, the lower bound in the 
lemma holds; then the expectation of the ratio is can be bounded by upper bounding 
the expectation of the numerator, together with the lower bound on the denominator. 
Expectation of the numerator, with respect to Pg*, proceeds just like in the proof of 
Theorem [TJ This time, we get 

ll^-rf j^(de)^(dw), 

En 

where JJ^{dd) is exp(— c||^ — ^*|P) times a probability measure for 9 in M"-, just as in the 
proof of Theorem [H Since the function x i— )■ xe~^^ is monotonically decreasing for large 

9 



enough x, we can see that, for large n, \\6 — O^W^ exp{—c\\6 — 6*\\'^) < Me„exp(— cM£„) 
on Ameu- Therefore, the expectation is eventually bounded by M£„exp(— cM£„). Com- 
bining this with the lemma's lower bound, we can find z/ > such that, for large n, 



Ee* / \\9-e*\\'Q4de)<Mene-'''\ 

But pn - e*f < / 11^ - e*fQn{de) by Jensen's inequality, so Ee*\\0 - O^f < Me„(l + 
e"'"''"). Take M' = 2M to complete the proof. D 

4 Numerical results 

4.1 Computational algorithm 

Computation of the empirical Bayes posterior mean can be carried out via a simple Gibbs 
sampler for uj and 6 = (6'i, . . . , ^„). A loop of this Gibbs sampler goes as follows: 



1. For i = 1, . . . ,n, sample independently 



with prob. oc ooe 2-n 



I UQ WlLli ]J1UU. LA. U/C " 

r), With prob. oc 




2. Set D = i^{i:9i = 0} and take w | 6* ~ Beta(an + D,l + n- D). 

Repeat this loop to get a sample from the posterior of 6. Then estimate 6 by averaging 
the posterior samples obtained from the Gibbs sampler. R code for this procedure is 



available at www . math . uic . edu/~ rgmartin 



In our experience, we have found that good numerical results are obtained for large 
K and large a^. However, theory does not provide any assistance with regard to the a 
parameter, since it appears in the proofs attached only to lower-order terms. We found 
that small to moderate a works well. This is based on the simple posterior mean for u, 
given (X,6'), 

a + 1 a + 1 n 

This suggests a small is best choice, so the expectation is close to D/n which would be 
the estimate of 1 — — . If a is very large, then u will get pushed too close to 1. For 
example, for k ^ 1 and large cr^, we have found that a = | is an overall good choice. 

4.2 Simulation studies 



For il lustration, we first reproduce a simulation study presented in iBhattacharya et al 



(120 12l ). In particular, we take samples X = {Xi, . . . ,Xn) of dimension n = 200 from 
the normal mean model Xi ~ N{6*,1). Recall the sparsity level s„ is the number of 
non-zero 6*'s. In this case, we consider s„ = 10,20,40, and the signals are fixed at 
values A = 7,8. Table [1] displays estimates of the mean squared error obtained from 
100 replications of X. In addition to the proposed empirical Bayes posterior mean es- 
timator (EBM), based on feasible (k = 0.99, cr^ = 100.1) and a = 0.25, the methods 
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Table 1: Mean square e rrors, based on 100 replicat ions, sampling X of dimension n = 200. 
First six rows are from iBhattacharya et al.l ( 12012| ): last row corresponds to the proposed 
empirical Bayes posterior mean. Boldface font indicates the column winner. 



being compared are the Bayesian las so estimator (BL) o f Park and Casellal (J2008f ). the 



Dirichlet-Laplac e estimator (PL) of iBhattacharva et al.l ( 120121 ) . the ordinary lasso es- 



timator (LS) of iTibshiranil (119961) . an empirical Bayes median estimator (EBMed) of 
Johnstone and Silvermanl (20041). a fully Bayes posterior median estima tor (PMedl) of 



Castillo and van der VaartI ( l2012l ). and the horseshoe estimator (HS) of ICarvalho et al. 



( 20101 ). Here, we find that our proposed empirical Bayes estimator is the top performer 



across all these settings. In particular, the proposed estimator is better able to detect 
the larger signals. 



As a second example, we reproduce a simulation study presented in lCastillo and van der Vaart 
( I2OI2I ). In this case, we look at n = 500, s„ = 25, 50, 100, and signals fixed at A = 3, 4, 5. 



Table [2] displays estimates of the mean squared error based on 100 replications. This 
time, the methods are two fully Bayes posterior mean estimates (PMl and PM2), two fully 
Bayes component- wise posterior medians (PMedl and PMed2). I Johnstone and Silvermanl 
(I2OO4I ) empirical Bayes mean (EBM) and median (EBMed), and hard thresholding (HT) 
and hard thresholding oracle (HTO) rules. Our proposed empirical Bayes estimator is 
competitive when A = A, and clearly dominates when A = 5, just like in the previous 
illustration. Interestingly, the empirical Bayes estimators are the better performers over- 
all in this case. We expect that with more careful tuning of a, the performance of our 
empirical Bayes estimator in the A = 3 cases can be improved. 

One rather unusual observation is that some of the methods have, for given s„, a mean 
square error increasing in the signal size A. We find this behavior to be counterintuitive, 
since it should be easier to detect stronger signals. The two thresholding estimators have 
decreasing mean square error as A increases, as does our proposed estimator. 

The posterior distribution for u is also of interest. This provides some indication 
of the dimension of the space on which the empirical Bayes posterior Qn for 9 concen- 
trates. One expects that Qn w ill concentrate on a space of dimension s„ <^ n (e.g., 
Castillo and van der VaartI l2012l . Theorem 2.1). In our case, the posterior distribution 
for u is concentrating around 1 — ^^ for some constant d > 0. Using two runs of the 
simulation summarized in Table m we have plotted (on the same scale) the two corre- 
sponding posterior distributions for u. In both cases, the posterior is concentrated exactly 
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25 50 100 



A 


3 


4 


5 


3 


4 


5 


3 


4 


5 


PMl 


111 


96 


94 


176 


165 


154 


267 


302 


307 


PM2 


106 


92 


82 


169 


165 


152 


269 


280 


274 


EBM 


103 


96 


93 


166 


177 


174 


271 


312 


319 


PMedl 


129 


83 


73 


205 


149 


130 


255 


279 


283 


PMed2 


125 


86 


68 


187 


148 


129 


273 


254 


245 


EBMed 


110 


81 


72 


162 


148 


142 


255 


294 


300 


HT 


175 


142 


70 


339 


284 


135 


676 


564 


252 


HTO 


136 


92 


84 


206 


159 


139 


306 


261 


245 


EBM 


142 


98 


53 


240 


160 


93 


399 


262 


151 



Table 2: Mean square err ors, based on 100 replications , samp ling X of dimension n = 500 



First eight rows are from iCastillo and van der VaartI (120 12f ) ; last row corresponds to the 



proposed empirical Bayes posterior mean. Boldface font indicates the column winner. 

where we expect that it would be. In particular, when the underlying true signal 6* is 
more sparse, i.e., smaller s„, u is stochastically larger, which means that the posterior 
distribution for 6 has more zeros. 

5 Discussion 

The paper has considered a classical problem of estimating a sparse high-dimensional 
normal mean vector, and we have proposed a novel empirical Bayes solution. Though 
the stated prior itself may seem overly informative, we show that the prior induces a sort 
of shrinkage effect, preventing the posterior from tracking the data too closely. We go 
on to prove that the corresponding empirical Bayes posterior concentrates around 6* at 
the optimal minimax rate, and asymptotic minimaxity of the empirical Bayes posterior 
mean also holds. 

The mathematical device used in our asymptotic analysis is an alt ernative represen- 



tation of the empirical Bayes model with a fractional likelihood. As in I Walker and Hjort 



( 1200 ll ). this fractional likelihood posterior is a powerful tool, though our concentration 
results do not follow immediately from theirs. With this adjustment, the prior changes 
to a very simple one, which we have called IIx. The key to success of our empirical Bayes 
posterior in the asymptotic framework is the particular beta prior on u, under ITx- From 
this prior, and the lower bound derived in Lemma [T|, the minimax rate En = s„ log(n/s„) 
drops out almost automatically. As we indicated, to push through the minimax concen- 
tration result, we only need the conditional prior on 6, given u, under IIx, to not interfere 
with the dynamics induced by the prior on u. Intuitively, there should be many priors 
that would accomplish this. We showed that an empirical Bayes prior that by centering 
a Gaus sian prior at the observations, und er IIx, minimax concentration follows relatively 



easily. ICastillo and van der VaartI (120121 ) have similar results, e.g., they make sure the 

12 



Jll 



0.85 



0.90 
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1.00 
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0.90 
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1.00 



(a) s„ = 50, so 1 - 4f = 0.90 



(b) s„ = 25, so 1 - ^ = 0.95 



Figure 2: Posterior distributions for uo when n = 500 and A = 5 for two values of s„. In 

100. 



each case, n = 0.99, a = 0.25, and a^ 



prior for 6 does not interfere by requiring suitably heavy tails. Their take-away message, 
however, is that with a careful choice of prior for 9, given the support size, proper pos- 
terior concentration holds under some assumptions [their equation (1.2)] on the prior on 
the support size. Our message, on the other hand, is the reverse: with a good choice 
of prior on the weight u, proper posterior concentration holds as long as the conditional 
prior for 6 \ u does not interfere; see Remark |2l 

References 

Abramovich, F., Benjamini, Y., Donoho, D. L., and Johnstone, I. M. (2006), "Adapting 
to unknown sparsity by controlling the false discovery rate," Ann. Statist., 34, 584-653. 

Barron, A., Schervish, M. J., and Wasserman, L. (1999), "The consistency of posterior 
distributions in nonparametric problems," Ann. Statist., 27, 536-561. 

Bhattacharya, A., Pati, D., Pillai, N. S., and Dunson, D. B. (2012), "Bayesian shrinkage," 
Unpublished manuscript, arXiv: 1212.6088. 

Bogdan, M., Chakrabarti, A., Frommlet, F., and Ghosh, J. K. (2011), "Asymptotic 
Bayes-optimality under sparsity of some multiple testing procedures," Ann. Statist., 
39, 1551-1579. 

Brown, L. D. and Greenshtein, E. (2009), "Nonparametric empirical Bayes and compound 
decision approaches to estimation of a high-dimensional vector of normal means," Ann. 
Statist, 37, 1685-1704. 

Cai, T. T. (2012), "Minimax and adaptive inference in nonparametric function estima- 
tion," Statist. Sci., 27, 31-50. 

Cai, T. T. and Jin, J. (2010), "Optimal rates of convergence for estimating the null 
density and proportion of nonnuU effects in large-scale multiple testing," Ann. Statist., 
38, 100-145. 



13 



Cai, T. T., Zhang, C.-H., and Zhou, H. H. (2010), "Optimal rates of convergence for 
covariance matrix estimation," Ann. Statist., 38, 2118-2144. 

Cai, T. T. and Zhou, H. H. (2012), "Optimal rates of convergence for sparse covariance 
matrix estimation," Ann. Statist, 40, 2389-2420. 

Carvalho, C M., Poison, N., and Scott, J. G. (2010), "The horseshoe estimator for sparse 
signals," Biometrika, 97, 465-480. 

Castillo, I. and van der Vaart, A. (2012), "Needles and straw in a haystack: Posterior 
concentration for possibly sparse sequences," Ann. Statist., 40, 2069-2101. 

Donoho, D. L. and Johnstone, I. M. (1994), "Minimax risk over /p-balls for /^-error," 
Probab. Theory Related Fields, 99, 277-303. 

Donoho, D. L., Johnstone, I. M., Hoch, J. C, and Stern, A. S. (1992), "Maximum entropy 
and the nearly black object," J. Roy. Statist. Soc. Ser. B, 54, 41-81, with discussion 
and a reply by the authors. 

Fan, J. and Lv, J. (2010), "A selective overview of variable selection in high dimensional 
feature space," Statist. Sinica, 20, 101-148. 

Ghosal, S., Ghosh, J. K., and Ramamoorthi, R. V. (1999), "Posterior consistency of 
Dirichlet mixtures in density estimation," Ann. Statist., 27, 143-158. 

Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000), "Convergence rates of posterior 
distributions," Ann. Statist., 28, 500-531. 

Jiang, W. and Zhang, C.-H. (2009), "General maximum likelihood empirical Bayes esti- 
mation of normal means," Ann. Statist., 37, 1647-1684. 

Johnstone, I. M. and Silverman, B. W. (2004), "Needles and straw in haystacks: empirical 
Bayes estimates of possibly sparse sequences," Ann. Statist., 32, 1594-1649. 

— (2005), "Empirical Bayes selection of wavelet thresholds," Ann. Statist., 33, 1700-1752. 

Lam, C. and Fan, J. (2009), "Sparsistency and rates of convergence in large covariance 
matrix estimation," Ann. Statist., 37, 4254-4278. 

Park, T. and Casella, G. (2008), "The Bayesian lasso," J. Amer. Statist. Assoc, 103, 
681-686. 

Schwartz, L. (1965), "On Bayes procedures," Z. Wahrs. verw. Geb., 4, 10-26. 

Shen, X. and Wasserman, L. (2001), "Rates of convergence of posterior distributions," 

Ann. Statist, 29, 687-714. 

Tibshirani, R. (1996), "Regression shrinkage and selection via the lasso," J. Roy. Statist. 
Soc. Ser. B, 58, 267-288. 

Walker, S. and Hjort, N. L. (2001), "On Bayesian consistency," J. R. Stat. Soc. Ser. B 
Stat Methodol, 63, 811-821. 

14 



